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In today’s world, social media has spread widely, and the social life of 
people have become deeply associated with social media use. They use it to 
communicate with each other, share events and news, and even run 
businesses. The huge growth in social media and the massive number of 
users has lured attackers to distribute harmful content through fake accounts, 
leading to a large number of people falling victim to those accounts. In this 
work, we propose a mechanism for identifying fake accounts on the social 
media site Twitter by using two methods to preprocess data and extract the 
most effective features, they are the spearman correlation coefficient and the 
chi-square test. For classification, we used supervised machine learning 
algorithms based on the ensemble system (stack method) by using random 
forest, support vector machine, and naive Bayes algorithms in the first level 
of the stack, and the logistic regression algorithm as a meta classifier. The 
stack ensemble system was shown to be effective in achieving the best 


results when compared to the algorithms used with it, with data accuracy 
reaching 99%. 
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1. INTRODUCTION 

Social media use is becoming increasingly common, and it has become an essential part of daily life 
around the world. Besides being a means of communication, it is also considered a means of gaining fame 
and running a business. Social media sites are popular because of people’s interests in making friends, 
posting pictures, tagging individuals in group photos, sharing their ideas and opinions on popular subjects, 
maintaining good working relationships, and having a general interest in others. 

Twitter is one of the social media platforms used for cooperation and communication between users. 
It was initiated in 2006 [1], and in recent years, the number of users has reached millions. Users share short 
messages, called tweets, of 140 characters or less, as well as pictures and videos, as the primary forms of 
communication on the network. Regrettably, the emergence of social communication on Twitter has drawn 
the attention of cybercriminals who leverage the trust between users to spread malicious content on the 
network, resulting in a large number of victims. They create fake accounts [2] and use them to spread false 
news or steal users’ accounts. Therefore, uncovering these accounts has become one of the major challenges 
faced by social media sites at present [3]. 

A variety of methods have been proposed by researchers to classify fake accounts [4]-[6], some 
using crowdsourcing [7] which rely on human effort to detect them, or using a graph [8], [9] by analyzing 
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network contents or using machine learning algorithms to classify accounts depending on specific features. 
Ersahin et al. [10] introduce a method of detecting fake accounts from the Twitter dataset using a 
classification algorithm called Naive Bayes. The accuracy of the pre-processed dataset was increased by 
using a supervised discretization technique called entropy minimization discretization (EMD), to reach a 
90.9% accurate result. 

Previous research [11] implemented a machine learning pipeline for online social networks to 
identify fake accounts. The framework classified groups of fake accounts instead of creating a forecast for 
each individual account to determine if they were generated by the same person. Several classification 
algorithms have been proposed, such as support vector machine (SVM), random forest, and deep neural 
network. 

A previous study [12] examined the identification of Twitter spam accounts to enhance the initial 
detection of spammer classes by incorporating both managed principal component analysis (PCA) and k- 
mean algorithms. To detect spam on social networks, several existing features were adopted, and new 
features were added to improve performance. Three classification algorithms, multi-layer perceptron (MLP), 
support vector machine, and random forest, were trained. The best results were found using the random forest 
algorithm, which had an accuracy of 96.30%. 

Another previous study [13] identified fake Instagram accounts as a problem of binary classification 
and proposed a cost-sensitive technique for reducing required features. The technique was based on a genetic 
algorithm to pick the best attributes for automatic classification of computation, correct the variance using the 
synthetic minority over-sampling technique-nominal continuous (SMOTE-NC) algorithm in a false 
computation dataset, and evaluate multiple methods of pattern recognition on pooled datasets. Ultimately, 
with a rating of 86%, the support vector machine and neural network-based techniques achieved the highest 
F1 score for robotic computing detection, and the neural network achieved the best F1 rating at 95%. In this 
paper, spearman's correlation coefficient and the chi-square test were used to preprocess Twitter data to find 
the best qualities for distinguishing between fake and real accounts [14], and the min-max normalization 
method to scale the data between (0, 1). For data classification, we used machine learning algorithms based 
on the stack ensemble system to increase the predictive strength of the algorithms and achieve the highest 
accuracy in data classification. 


2. RESEARCH METHOD 

This section discusses the suggested method for detecting fake accounts on social media and 
contains six basic steps. They are; dataset collection, data cleaning, features extraction and selection, data 
scaling, a classification stage depending on the ensemble system (stack method), and an evaluation and 
comparison stage. Figure 1 shows the phases of the technique adopted. 
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Figure 1. The steps of the technique adopted for the detection process 


Int J Elec & Comp Eng, Vol. 12, No. 3, June 2022: 3013-3022 


Int J Elec & Comp Eng ISSN: 2088-8708 O 3015 


2.1. Twitter data collection 

The Management Information Base "MIB" dataset [15] is used in this research, consisting of five 
datasets obtained from Twitter, two of them represent real accounts and three of them are fake accounts. The 
sum of all accounts is 5,301 with 29 features. They can be explained: i) the fake project (TFP) consists of 469 
real accounts, ii) elections 2013 (E13) consist of 1,481 real accounts, iii) fastfollowerz dataset consist of 
1,169 fake accounts, iv) InterTwitter dataset consist of 1,337 fake accounts, and v) Twitter technology 
dataset consist of 845 fake accounts. 


2.2. Data cleaning 

During the data collection process, some errors occur that lead to the loss of some data. This 
problem leads to a decrease in the quality of the data and thus leads to low-quality results when analyzing 
and exploring them. Our grouped data contains several blank fields, as shown in Figure 2, where the yellow 
color denotes the empty fields. Keeping these empty fields negatively affects the classification process and 
leads to inaccurate results, so this stage includes removing the columns of features that contain 30% or more 
blank fields [16]. 
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Figure 2. All features 


2.3. Feature extraction and selection 

Feature extraction is used to determine the optimal subset of features for model creation by 
eliminating inappropriate or redundant features, thereby concentrating only on necessary features. The 
purpose of this strategy is to minimize the training time for the prediction model by reducing 
over-processing, to improve the model's generalizability, and to help researchers interpret the model. In the 
proposed research, two methods are used to select the best features. These methods are: 


2.3.1. Spearman rank correlation 

Spearman correlation coefficient is one of the filtering methods used for feature selection [17], 
which tests the intensity and orientation of the monotonic association between two quantitative variables. 
They have values ranging from (-1) to (+1) to show the correlation degree. When the two variables are 
independent, each correlation measure is entirely zero. The result of spearman is a table that contains the 
correlation coefficients that link each variable in the dataset to the other variables. The following formula is 
employed to calculate the spearman rank correlation: 


eats (1) 


m(m2-1) 


S=1-( 


where S=spearman rank correlation, dis=represents the distinction between the respective variable ranks, 
m=number of observations. 
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2.3.2. Chi-square test 

The chi-square test is one of the statistical methods used to verify the independence of two 
events [18]. Whenever the two features are independent, the calculated chi-value is small compared to the 
critical chi value, meaning a large calculated chi-square value disproves the hypothesis of independence. 
A large chi-square value indicates which feature is dependent on the response and can be used for model 
training. The chi square formula is: 


0\-E\)? 
x2, = OO (2) 


where e=degree of freedom, O=observed value(s), E=expected value(s). 

First, the spearman correlation coefficient is used to find the correlation between all the features, 
whether numerical or qualitative, depending on the data rank, and extract only the correlated features. After 
applying spearman's correlation coefficient, two features of the remaining dataset contain empty fields. They 
are; default_profile and background_image. These features must be configured to use the statistical 
chi-square test. To fill in these fields, the number of current values for the columns is calculated, and the 
most common value was chosen to fill in the empty fields [19]. Then a chi-square test is implemented on 
spearman's output to find the correlation between the features and the target (output), to choose the best 
features that affect whether the account is fake or real to use in the classification process. The flow chart in 
Figure 3 explains the feature extraction process. 
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Figure 3. Feature extraction flow chart 


2.4. Data scaling (normalization) 

Scaling of features is a technique used to normalize the range of individual data variables or 
features. In this section, all numeric values in the selected features are listed between zero and one by using 
Min-Max normalization to increase the processing speed. In Min-Max normalization, the minimum value of 
the variable is converted to zero and the maximum value is converted to one, while the rest of the values are 
converted to a decimal number between zero and one. The general formula is: 


' v—-min, ` . 
v= Dax, min, (MeW_Max, new_mina) + new_mina (3) 


whereas v= is an original value and v’=is the normalized value. 


2.5. Detection model 

We used the ensemble system by inserting features extracted from the datasets after normalizing 
them into a stacking. Stacking is an ensemble learning method that combines several classifiers or regression 
models through the meta-classifier or meta-regressor to improve predictive strength [20]. As shown in 
Figure 4, based on a full training group, the basic-level models are trained, and then the meta-model is 
trained on the model-like features of the basic level outputs. The algorithm below summarizes stacking. 


Stacking Algorithm 

Input: training data D={xi, yi}i™) 
Output: ensemble classifier H 

Step 1: learn first-level classifiers 
for t = Lto T do 

learn ht based on D 

end for 

Step 2: create a new prediction data set 
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for i=l to m do 

Dn={X'i, Yi}, where x’i={hz (xi),..., hT (xi)} 
end for 

Step 3: learn a meta-classifier 

learn H based on Da 

return H 


| Prediction 2 = Final prediction 
Base learner 2 a Meta classifier mem‘ 
Data ye 


Figure 4. Stack ensemble system 


The most important characteristic of the stack method is that it can benefit from the performance of 
a group of well-performing models in a classification or regression task and can provide better predictions 
than any individual model in the group. In our research, a group of the most common algorithms are used, 
four different learning techniques are trained and tested depending on the stack method. These algorithms 
are: 


2.5.1. Random forest algorithm 

Random forest (RF) is a powerful machine learning algorithm that performs the tasks of 
classification and regression [21], [22]. The basic building block of a random forest is derived from the 
decision tree. The model is obtained by dividing the data into bootstrapping samples depending on the 
number of trees that we want to perform, building a simple prediction model within each section, and 
combining their outputs based on the bagging ensemble learning technique to get to the final prediction. 


2.5.2. Support vector machine algorithm 

Support vector machine (SVM) is one of the most popular supervised learning algorithms that finds 
the optimal hyperplane, which separates the data points into two-component by maximizing the margin, 
which represents the distance from the decision surface to the closest data point [23], [24]. SVM is effective 
in cases where the number of dimensions is greater than the number of samples given. 


2.5.3. Naïve Bayes algorithm 

Naive Bayes (NB) is a type of classifier of probabilities. It works on the theory of Bayes and deals 
with both categorical variables and continuous variables [25], [26]. NB assumes that each pair of labeled- 
value features is independent of each other, meaning that the presence of any particular feature in a class is 
unrelated to the presence of other features. The NB equation is: 


P (A) P(B\A) 


P (A\B) = 


(4) 


2.5.4. Logistic regression algorithm 

Logistic regression (LR) is one of the machine learning algorithms used in binary classification [27]. 
It is a simple and commonly used algorithm that measures the relationship between one variable and many 
dependent variables (which we want to predict). As it uses its logistic function to estimate probabilities, to 
make a prediction, these probabilities must be converted into binary values, a task known as the sigmoid 
function. The sigmoid function is a curve in the form of an S that takes any number of real values and places 
them in the range between zero and one. 

In the proposed method, the first level algorithms of the stack, including random forest, SVM, and 
naive Bayes, are trained on the training set, a k-fold validation is performed on each of these learners [28], 
and the validated expected values are collected from all the first level algorithms to use them as inputs to the 
meta classifier (logistic regression). The same steps are used to generate predictions on the test set. The 
accounts in the test suite are classified into real and fake accounts based on the training suite that is provided. 
The data was divided into a training and test group by choosing 75% as training data and 25% as testing data 
using stratified sampling [29] to ensure an equal division and maintain the same proportion of classes. 
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Default classifier parameters have been used, just the random state parameter is changed to (one) for 
each of the train-test-split and for the random forest algorithm to have steady and acceptable results. A 
prediction for each of the three basic algorithms is made using the dataset. The classifiers are implemented 
with 10-fold validation, where the data is divided into ten parts, such that every time the classifier is trained 
for nine parts and tested on the basis of the tenth part, the training and testing process are replicated ten times. 
Then, these predictions are fed into the meta-learner (logistic regression) to create the group prediction. 


2.6. Evaluation and comparison 

A confusion matrix is used as the main source of assessment in this research to evaluate false 
detection models. The result of the confusion matrix of the stack is evaluated and compared with the results 
of the algorithms that were used with it. Table 1 explains the confusion matrix in more detail. A confusion 
matrix is a technique used to describe the performance of classification algorithms, and which gives a better 
understanding of the classification model and the types of errors it can cause. All the results of the algorithms 
are plotted in a confusion matrix to determine where the error occurred. 
— Accuracy: reflects the number of correctly classified instances in both groups over the overall number 

of all instances within a dataset. 


TP+TN 


Accuracy rate = ————_ 
TP+FN+FP+TN 


(5) 


— Precision: is the proportion of accurate positive predictions to the total number of positive predictions. 


TP 
TP+FP 


Precision = 


(6) 


— Recall: is the ratio of accurate positive predictions to the total number of positive examples in the set of 
tests. 


TP 
TP+FN 


Recall = 


(7) 


—  F_measure: is the measure of model efficiency, a weighted average of model precision and recall. 


Precision xRecall 
F Measure = 2 x ———————_ (8) 


Precision+Recall 


Table 1. Confusion matrix 


Actual Class Predicted Class 
Positive Negative 
Positive TP FN 
correctly classified as positives incorrectly classified as negatives 
Negative FP TN 
incorrectly classified as positives correctly classified as negatives 


3. RESULTS AND DISCUSSION 
3.1. Pre-processing of dataset 

This stage includes the results of data cleaning, followed by feature extraction and selection 
methods, and normalization method. In data cleaning, the columns containing 30% empty fields were 
deleted, and the features were reduced from 29 to 23. Feature extraction and selection involved using two 
filtering methods. First, the spearman correlation coefficient was used to find the correlation between all data, 
where the features were reduced to 7 in the dataset. Table 2 represents the result of spearman's correlation 
coefficient. 

Then, the chi-square test was implemented on spearman's output. Where the number of features has 
reached five, they are as follows: (Statuses_count, Followers_count, Friends_count, Favourites_count, and 
Listed_count) the values of these features are shown in Figure 5, where Figure 5 (a) is Statuses_count, which 
represents the total number of tweets sent by the account. Figure 5(b) Followers_count is the number of 
followers who follow the user. Figure 5(c) Friends_count is the number of friends who follow the account 
holder. Figure 5(d) Favourites_count describe how many times each user account's tweets have been liked 
over the course of the account's existence, and Figure 5(e) is Listed_count, which is the number of people 
who have added the user to their list. The last stage of the data preprocessing process is applying Min-Max 
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normalization to put all the data values on one level. All data values were converted into values between zero 
and one. 
Table 2. The result of Spearman rank correlation 
Statuses Followers Friends Favorites Listed Default Background Dataset 
Count Count Count Count Count Profile Image (output) 
Statuses Count 1 0.84432 0.1598 0.7461 0.60878 0.00815 -0.0002 -0.7338 
Followers Count 0.84432 1 0.233297 0.67047 0.63751 -0.0261 -0.00568 -0.6914 
Friends Count 0.1598 0.233297 1 0.03005 0.15653 -0.2801 -0.04419 0.21315 
Favorites Count 0.7461 0.67047 0.03005 1 0.60688 -0.1314 -0.0067 -0.7081 
Listed Count 0.60878 0.63751 0.15653 0.60688 1 -0.1088 -0.0018 -0.5987 
Default Profile 0.00815 -0.0261 -0.2801 -0.1314 -0.1088 1 0.16169 -0.1207 
Background Image -0.0002 -0.00568 -0.04419 -0.0067 -0.0018 0.16169 1 -0.0310 
Dataset (output) -0.7338 -0.6914 0.21315 -0.7081 -0.5987 -0.1207 -0.0310 1 
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Figure 5. Values of the selected account features (a) statuses_count, (b) followers_count, (c) friends_count, 
(d) favourites_count, and (e) listed_count 
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3.2. Detecting of fake accounts 

This stage includes the result of the confusion matrix of the stack method and the algorithms that 
were used with it. As shown in Table 3, based on the entire set of suggested features, the stack method 
achieved a high evaluation rate in terms of accuracy, precision, and Fl_score, compared with using each 
algorithm separately. The accuracy of the stack method was 99%, the precision was 99%, and the F1_score 
was 99.2%. Figure 6 shows the visualization of the confusion matrix between the stack method and the 
algorithms. Where Figure 6(a) is the accuracy, Figure 6(b) is the Fl_score, Figure 6(c) is the recall, and 
Figure 6(d) is the precision. 


Table 3. A comparison between the stack system and the algorithms used 


Name Accuracy _ Fl_score Recall Precision 

Stack 0.990196 0.992257 0.994033 0.990488 
Random forest 0.988688 0.991055 0.991647 0.990465 
SVM 0.913273 0.934992 0.986874 0.888292 
Naive Bayes 0.763952 0.841839 0.994033 0.730061 
Logistic regression 0.674962 0.795444 1 0.660362 
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Figure 6. Visualization of the confusion matrix between the stack method and the algorithms: (a) accuracy, 
(b) Fl_score, (c) recall, (d) precision 


4. CONCLUSION 

In this paper, we proposed a model for fake accounts detection based on a collection of basic Twitter 
features that are publicly accessible. These feature sets were derived from the profile details available in the 
Tweets of users. To improve the detection model, the stack ensemble method based on four machine learning 
algorithms was used, and two feature selection methods were implemented to determine which features in the 
detection process were most influential. Initial work results show that successful results can be achieved in a 
stack ensemble method by using random forest, SVM, and naive Bayes classification algorithms as base level 
classifiers, and by using logistic regression as a meta classifier. By implementing this methodology, the 
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accuracy of the data reached 99%. The results also revealed that the ensemble system has a significantly 
higher impact on the accuracy of the detection process over using each algorithm separately. For future work, 
much larger data could be collected using the same methodology as this work but using other machine 
learning algorithms. 
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